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Abstract: Given a positive definite covariance matrix E, we strive to con- 
struct an optimal approximate factor analysis model HH T + D, with H 
having a prescribed number of columns and D > diagonal. The optimality 
criterion we minimize is the I-divergence between the corresponding normal 
laws. Lifting the problem into a properly chosen larger space enables us to 
derive an alternating minimization algorithm a la Csiszar-Tusnady for the 
construction of the best approximation. The convergence properties of the 
algorithm are studied, with special attention given to the case where D is 
singular. 
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1. Introduction 

Factor analysis (FA) , in its original formulation, deals with the linear statistical 
model 



where H is a deterministic matrix, X and e are independent random vectors, 
the first with dimension smaller than Y, the second with independent compo- 
nents. What makes this model attractive in applied research is the data reduction 
mechanism built in it. A large number of observed variables Y are explained in 
terms of a small number of unobserved (latent) variables X perturbed by the 
independent noise e. Under normality assumptions, which are the rule in the 
standard theory, all the laws of the model are specified by covariance matrices. 
More precisely, assume that X and e are zero mean independent normal vectors 



Y = HX + e, 
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with Cov(X) = P and Cov(e) = D, where D is diagonal. It follows from (1.1) 
that Cov(y) = HPH T + D. Since in the present paper, basically only covari- 
ances are considered, the results obtained will also be valid, in a weaker sense, 
in a non Gaussian environment. 

Building a factor analysis model of the observed variables requires the solu- 
tion of a difficult algebraic problem. Given E, the covariance matrix of Y, find 
the triples (H, P, D) such that £ = HPH T + D. As it turns out, the right tools 
to deal with the construction of an exact FA model come from the theory of 
stochastic realization, see Finesso and Picci (1984) for an early contribution on 
the subject. Due to the structural constraint on D, assumed to be diagonal, the 
existence and uniqueness of a FA model are not guaranteed. 

In the present paper we strive to construct an optimal approximate FA model. 
The criterion chosen to evaluate the closeness of covariances is the I-divergence 
between the corresponding normal laws. We propose an algorithm for the con- 
struction of the optimal approximation, inspired by the alternating minimization 
procedure of Csiszar and Tusnady (1984) and Finesso and Spreij (2006). 

The remainder of the paper is organized as follows. The FA model is intro- 
duced in Section 2 and the approximation problem is posed and discussed in 
Section 3. Section 4 recasts the problem as a double minimization in a larger 
space, making it amenable to a solution in terms of alternating minimization. 
It will be seen that both resulting I-divergence minimization problems satisfy 
the so-called Pythagorean rule, guaranteeing the optimality. In Section 5, we 
present the alternating minimization algorithm, provide alternative versions of 
it, and study its asymptotical properties. We also point out, in Section 6, the 
relations and differences between our algorithm and the EM-algorithm for the 
estimation of the parameters of a factor analysis model. Section 7 is dedicated 
to a constrained version of the optimization problem (the singular D case) and 
the pertinent alternating minimization algorithm. The study of the singular case 
also sheds light on the boundary limit points of the algorithm presented in Sec- 
tion 5. In the Appendix we have collected some known properties on matrix 
inversion and I-divergence between normal distributions for easy reference, as 
well as most proofs of the technical results. 

The present paper is a considerably extended version of Finesso and Spreij 
(2007), moreover providing easier proofs of some of the results already contained 
in that reference. 

2. The model 

Consider independent random vectors Z and e, of respective dimensions k and 
n, both normally distributed with zero mean. For simplicity P = Cov(Z) is 
assumed to be invertible. For any n x k matrix L let the random vector Y, of 
dimension n, be defined by 

Y = LZ + e. (2.1) 
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The linear model (2.1), ubiquitous in Statistics, becomes the standard Factor 
Analysis (FA) model under the extra constraints 

k < n, and Cov(e) — D > 0, diagonal. 

In many applications one starts with a given, zero mean normal vector Y , and 
wants to find the parameters P, L, and D of a FA model for Y. The above 
constraints impose a special structure to the covariance of Y, 

Cov(F) = LPL T + D, (2.2) 

which is non generic since k < n and D is diagonal, therefore not all normal 
vectors Y admit a FA model. To elucidate this, consider the joint normal vector 



Z) \I OJ \ £/ 
whose covariance matrix is given by 

Cov«0 = ( LP £t +D f). (2-4) 

The constraints imposed on Cov(V) by the FA model are related to a conditional 
independence property. 

Lemma 2.1. Let Y 6 K. n be a zero mean normal vector, then Cov(y) = 
LPL T + D, for some (L, P, D), with L € R nxk , P > 0, and diagonal D > 
if and only if there exists a k-dimensional zero mean normal vector Z , with 
Cov(Z) = P, such that the components ofY are conditionally independent given 
Z. 

Proof. Assume that Cov(F) = LPL T + D and construct a matrix E as in the 
right hand side of (2.4). Clearly S > 0, since P > and D > 0, and therefore it 
is a bonafide covariance matrix, hence there exists a multivariate normal vector 
V whose covariance matrix is S. Writing V T = (Y T ,Z T ) T for this vector, it 
holds that Cov(Z) = P, moreover Cov(Y\Z) = D (see Equation (A.l)). The 
conditional independence follows, since D is diagonal by assumption. For the 
converse assume there exists a random vector Z as prescribed in the Lemma. 
Then Cov(Y\Z) is diagonal by the assumed conditional independence, while 
E(Y\Z) = LZ for some L, being a linear function of Z. We conclude that 
Cov(F) = Cov(E(Y\Z)) + Cov(y|Z) = LPL T + D as requested. □ 

The above setup is standard in system identification, see Finesso and Picci 
(1984). It is often convenient to give an equivalent reparametrization of model (2.3) 
as follows. Let P — Q T Q, where Q is a k x k square root of P, and define 
X = Q~ T Z. Model (2.3) then becomes 



V 



Y\ _ (LQ 1 A (X 



zl \ Q °J \ £ 
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where Cov(X) = /. The free parameters are now H = LQ T , the diagonal D > 0, 
and the invertible k x k matrix Q. In this paper we will mostly, but not always, 
use the latter parametrization, which will be written directly in terms of the 
newly defined parameters as 

for which 

„ , . (HH T + D HQ \ . . 

C ° v(y) = { (HQY Q T Q ) ' ^ 

Note that, with this parametrization, 

y = HX + e, and Cov(F) = HH T + D. (2.7) 

For simplicity, in the first part of the paper, it will be assumed that H has full 
column rank and D > 0. 



3. Problem statement 

Let Y be an n dimensional, normal vector, with zero mean and E = Cov(y) 
given. As a consequence of Lemma 2.1 it is not always possible to find an ex- 
act FA analysis model (2.3), nor equivalently (2.5), for Y. As it will be proved 
below, one can always find a best approximate FA model. Here 'best' refers to 
optimizing a given criterion of closeness. In this paper we opt for minimizing the 
I-divergence {a.k.a. Kullback-Leibler divergence). Recall that, for given proba- 
bility measures Pi and P2, defined on the same measurable space, and such that 
Pi <C P2, the I-divergence is defined as 

dP 

Z(Pi||P 2 )=E Pl lo gci ^. (3.1) 

In the case of normal laws the I-divergence (3.1) can be explicitly computed. 
Let v\ and ^2 be two normal distributions on K m , both with zero mean, and 
whose covariance matrices, Ei and E 2 respectively, are both non-singular. Then 
the distributions are equivalent and the I-divergence T(y\\\v2) takes the explicit 
form, see Appendix A, 

IHM = ilog HI - y + ^(E^Ei). (3.2) 

Since, because of zero means, the I-divergence only depends on the covariance 
matrices, we usually write X(Ei||E2) instead of T{v\\\v2). Note that Z(Ei||E2), 
computed as in (3.2), can be considered as a I-divergence between two positive 
definite matrices, without referring to normal distributions. Hence the approxi- 
mation Problem 3.1 below, is meaningful also without normality assumptions. 

The problem of constructing an approximate FA model, i.e. of approximating 
a given covariance E £ K nx ™ by HH T + D, can be cast as the following 
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Problem 3.1. Given E > of size n x n and an integer k < n, minimize 

X(E| \HH T + D) = \ log lHH ^ Dl ~ + ^r((i/H T + £>) _1 E), (3.3) 

where the minimum, if it exists, is taken over all diagonal Z? > 0, and H G M. nxk . 

Note that Z(E| \HH T + D) < oo if and only if HH T + D is invertible, which 
will be a standing assumption in all that follows. 

The first result is that a minimum in Problem 3.1 indeed exists. It is formu- 
lated as Proposition 3.2 below, whose proof, requiring results from Section 5, is 
given in Appendix D. 

Proposition 3.2. There exist matrices H* G M. nxk , and nonnegative diagonal 
D* G K" xn , that minimize the I- divergence in Problem 3.1. 

In a statistical setup, the approximation problem has an equivalent formula- 
tion as an estimation problem. One then will have a sequence of idd observations 
Yi, . . . , Yjy, each distributed according to (2.7). The matrices H and D are the 
unknown parameters that have to be estimated, which can be done applying the 
maximum likelihood (ML) method. For big enough N, the sample covariance 
matrix will be positive definite a.s. under the assumption that the covariance 
matrix of the Yi is positive definite. Denote the sample covariance matrix by 
E. The computation of the ML estimators of H and D is equivalent to solving 
the minimization problem 3.1. Indeed the normal log likelihood £(H, D) with H 
and D as parameters yields 

i(H, D) = log(27r) - i log \HH T + D\- ^tr((i/i/ T + D^S) . 

One immediately sees that £(H, D) is, up to constants not depending on H and 
D, equal to -X(S| \HH T + D). Hence, maximum likelihood estimation com- 
pletely parallels I-divergence minimization, only the interpretation is different. 

The equations for the maximum likelihood estimators can be found in e.g. 
Section 14.3.1 of Anderson (1984). In terms of the unknown parameters H and 
D, with D assumed to be non-singular, they are 

iJ = (E — HH T )D~ 1 H (3.4) 
D = A(E — HH T ). (3.5) 

where A(M), defined for any square M, coincides with M on the diagonal 
and is zero elsewhere. Note that the matrix HH T + D obtained by maximum 
likelihood estimation, is automatically invertible. Then it can be verified that 
equation (3.4) is equivalent to 

H = %{HH T +D)- 1 H, (3.6) 

which is also meaningful, when D is not invertible. 
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The maximum likelihood equations (3.4) and (3.5) for the alternative parametriza- 
tion, as induced by (2.3), take the form 

L={t-LPL T )D- 1 L (3.7) 

D = A(£ — LPL T ), (3.8) 

with (3.7) equivalent to 

L = £(LPL T + D)~ 1 L. (3.9) 

It is clear that the system of equations (3.4), (3.5) does not have an explicit 
solution. For this reason numerical algorithms have been devised, among others 
an adapted version of the EM algorithm, see Rubin and Thayer (1982). In the 
present paper we consider an alternative approach and, in Section 5, we compare 
the ensuing algorithm with the EM. 

In Finesso and Sprcij (2006) we considered an approximate nonnegative ma- 
trix factorization problem, where the objective function was also of I-divergence 
type. In that relaxation technique lifted the original minimization to 

a double minimization in a higher dimensional space and led naturally to an 
alternating minimization algorithm. A similar approach, containing the core of 
the present paper, will be followed below. 



4. Lifted version of the problem 

In this section we recast Problem 3.1 in a higher dimensional space, making it 
amenable to solution via two partial minimizations. Later on this approach will 
lead to an alternating minimization algorithm. 

First we introduce two relevant classes of normal distributions. All random 
vectors are supposed to be zero mean and normal, therefore their laws are com- 
pletely specified by covariance matrices. Consider the set £ comprising all the 
(n + fc)-dimensional covariance matrices. An element E G £ can always be 
decomposed as 

Ell Sl2 
S21 £22, 



(4.1) 



where En and £22 are square, of respective sizes n and k. Two subsets of £ will 
play a major role in what follows. The subset So of £, contains the covariances 
that can be written as in (4.1), with En = E, a given matrix, i.e. 

S = {SeS:S n = S}. 

Elements of So will often be denoted by So. Also of interest is the subset £1 of 
£ whose elements are covariances for which the decomposition (4.1) takes the 
special form 

(HH^ + D HQ \ 

(HQ) T Q T Qj' (4 ' 2) 
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for certain matrices H, D, Q with D diagonal, i.e. 

Si = {E G E : 3H, D, Q : E n = HH T + D. S 12 = HQ, E 22 = Q T Q}. 
Elements of E x will be often denoted by E(i7, -D, Q) or by Si. 

In the present section we study the lifted 
Problem 4.1. 

min Z(Eo||Si) 
SoeSo,SitzEi 

viewing it as a double minimization over the variables So an d Si. Problem 4.1 
and Problem 3.1 are related by the following proposition, whose proof is deferred 
to Appendix D. 

Proposition 4.2. Let S be given. It holds that 

minI(S||iJH T +D) = min I(S ||Si). 
H ' D EoeEo.EieEj 

4- 1. Partial minimization problems 

The first partial minimization, required for the solution of Problem 4.1, is as 
follows. 

Problem 4.3. Given a strictly positive definite covariance matrix S £ E, find 

min I(Eo||S). 

The unique solution to this problem can be computed analytically. 
Proposition 4.4. The unique minimizer S* of Problem 4-3 is given by 



S SSj^E 



s* = I " ie i" 11 " 12 ~ _ x ] > o. 

V S 2 iS u S S 22 — S 2 iS 11 (Sii — S)S X1 Si 2 / 

Moreover 

X(S*||S)=I(S||Sn), (4.3) 

and the Pythagorean rule 

Z(So||S) =I(S ||S*)+Z(S*||S) (4.4) 
holds for any strictly positive So G So- 

Proof. See Appendix D. □ 

Remark 4.5. Using the decomposition of Lemma B.l, one can easily compute 
the inverse of the matrix S* of Proposition 4.4 and verify that (S*) _1 differs 
from S _1 only in the upper left block. Moreover, in terms of _L 2 -norms (the L 2 - 
norm of a matrix M is ||M|| = (tr(M T Af)) 1 / 2 ) we have for the approximation 
of the inverse the identity HE" 1 - (S*)" 1 !! = HE^ 1 - E _1 ||. 
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Next we turn to the second partial minimization 
Problem 4.6. Given a strictly positive definite covariance matrix E e E, find 

min I(E||Ei). 

A solution to this problem is given explicitly in the proposition below. To state 
the result we introduce the following notation: for any nonnegative definite P 
denote by P 1 / 2 any matrix satisfying P 1 / 2 p 1 / 2 — p 5 and by P -1 / 2 its inverse, 
if it exists. Furthermore we put E u = En — E 12 E 2 ~ 2 1 E 2 i. 

Proposition 4.7. A minimizer E(iP, D* , Q*) of Problem 4-6 is given by 



— ^22 

H = E19E 



12-^22 



-1/2 



D* = A(E X1 - E 12 E 22 1 E 2 i) 
corresponding to the minimizing matrix 



E* = Z(H*,D*,Q*) = 



^i2E 22 1 S2i + A(En — Ei2E 22 1 E2i) 



E21 E 2 2, 
Moreover, X(E||E*) = I(En||A(Ei X )) and the Pythagorean rule 

I(E||Ei)=X(E||E*)+J(E*||Ei) (4.5) 
holds for any Ei = E(ff, D, Q) € Si. 

Proof. See Appendix D. □ 

Note that this problem cannot have a unique solution in terms of the matrices 
H and Q. Indeed, if U is a unitary k x k matrix and H' = HU, Q' = U T Q, then 
H'H ,T = HH T , Q' T Q' = Q T Q and Pf'Q' = HQ. Nevertheless, the optimal 
matrices HH T , iJQ and Q T Q are unique, as it can be easily checked using the 
expressions in Proposition 4.7. 

Remark 4.8. Note that, since E is supposed to be strictly positive, En — 
Ei 2 E 2 " 2 1 E 2 i > 0. It follows that D* = A(S U - Ei 2 E 2 " 2 1 E 2 i) is strictly positive. 

Remark 4.9. The matrix E* in Proposition 4.7 differs from E only in the 
upper left block and in terms of L -norms we have the identity 1 1 E — E* 1 1 = 
||En — A(En)||, compare with Remark 4.5. 

We close this section by considering a constrained version of the second partial 
minimization Problem 4.6. The constraint that we impose is Q = Qq, where Qq 
is fixed or, slightly more general, with P := Q^Qo fixed. The matrices H and 
D remain free. For clarity we state this as 
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Problem 4.10. Given strictly positive covariances EgS and Pq € R kxk , and 
letting Qq be any matrix satisfying Po = QqQo, find 

min X(E||Ei). 
s(H,r>,Q )eS 1 

The solution is given in the next proposition. 
Proposition 4.11. A solution Eg of Problem 4-10 is given by 

_ /Ei2S 22 1 PoS 22 1 E2i + A(Eii — Ei2E 22 E21) Si 2 E 22 1 Po 

° V P)^22 ^21 Po 

for which H* = E^E^Qj an d D* is as in Proposition 4-7. 

Proof. See Appendix D. □ 

Note that for the constrained problem no Pythagorean rule holds. How- 
ever (4.5) can be used to compare the optimal I-divergences of Problem 4.6 
and Problem 4.10. Since Sq € Si, applying (4.5) one gets 

I(E||S5)=2:(S||E*)+X(S*||ES), 

hence I(S||Sq) > I(S||E*), where E* is as in Proposition 4.7. The quantity 
Z(E* 1 1 Eg) is the extra cost incurred solving Problem 4.10 instead of Problem 4.6. 
An elementary computation gives 

I(E*||ES)=X(E 22 ||Po). 

In fact this is an easy consequence of the relation, similar to Remark 4.5, 



We see that the two optimizing matrices in the constrained case (Proposi- 
tion 4.11) and unconstrained case (Proposition 4.7) coincide iff the constraining 
matrix Pq satisfies Pq = E2 2 - 



5. Alternating minimization algorithm 

In this section, the core of the paper, the two partial minimizations of Section 4 
are combined into an alternating minimization algorithm for the solution of 
Problem 3.1. A number of equivalent formulations of the updating equations 
will be presented and their properties discussed. 



5.1. The algorithm 

We suppose that the given covariance matrix E is strictly positive definite. To 
setup the iterative minimization algorithm, assign initial values H ,D ,Q to 
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the parameters, with D diagonal, Qq invertible and HqHJ + D invertible. 
The updating rules are constructed as follows. Let H t , D t ,Qt be the parameters 
at the t-th iteration, and Si.t = S(i/ t , D t , Qt) the corresponding covariancc, 
defined as in (4.2). Now solve the two partial minimizations as illustrated below. 



(H t ,D t ,Qt) — > S .i — > {Ht+i, D t+ i, Qt+i) ■ ■ ■ , 

min Z(E ||Ei,t) ' min Z(E ,t||Ei) 

where Eo it denotes the solution of the first minimization with input £1,4. 
To express in a compact form the resulting update equations, define 

Rt — I~HJ {H t Hj +D t )^ 1 H t +Hj {H t Hj +D t y 1 T^{H t Hj + D t )~ 1 H t . (5.1) 

Note that, by Remark 4.8, H t Hj + D t is actually invertible for all t, since 
both HqHq + Dq and Qo have been chosen to be invertible. It follows, by 
Corollary B.4, that also / — Hj (H t Hj + D t )~ 1 H t , and consequently Rt, are 
strictly positive and therefore invertible. The update equations resulting from 
the cascade of the two minimizations are 

Q t+1 = (QjRtQt) 12 , (5.2) 



H t+1 = Y,(H t Hj + D t )- l H t Q t Q-^ (5.3) 
D t+1 =A(X-H t+1 Hj +1 ). (5.4) 

Properly choosing the square root in Equation (5.2) makes Qt disappear from 
the update equations. This is an attractive feature since only H t and D t are 
needed to construct the approximate FA model H t Hj + D t at the t-th step of 
the algorithm. Observe that (Qj RtQt) 1 ^ 2 = Rt^ 2 Qt, where R\^ 2 is a symmetric 
root of Rt, is a possible root for the right hand side of Equation (5.2). Inserting 

1 /2 

the resulting matrix Qt+i = R t Qt into Equation (5.3) results in 

Algorithm 5.1. Given H t , D t from the i-th step, and Rt as in (5.1), the update 
equations for a I-divergence minimizing algorithm are 

H t+l = V{H t Hj + Dt)- 1 UtRTt 112 (5.5) 
A+i = A(E-H t+1 Hj +1 ). (5.6) 

Since Rt only depends on H t and D t , see (5.1), the parameter Q t has been 
effectively eliminated. 



5.2. Alternative algorithms 

Algorithm 5.1 has two drawbacks making its implementation computationally 
awkward. To update H t via equation (5.5) one has to compute, at each step, 
the square root of the k x k matrix R t and the inverse of the n x n matrix 
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H t Hj + D t . Taking a slightly different approach it is possible to reorganize the 
algorithm in order to avoid the computation of square roots at each step, and 
to reduce to k x k the size of the matrices that need to be inverted. 

To avoid the computation of square roots at each step there are at least two 
possible variants of Algorithm 5.1, both involving a reparametrization. The first 
approach is to use the alternative parametrization (2.3) and to write update 
equations for the parameters L, D, P. Translated in terms of the matrices L t :— 
H t Q^ T and P t — QjQt, Algorithm 5.1 becomes 

Algorithm 5.2. Given L t , P t , and D t from the t-th step, the update equations 
for a I-divergence minimizing algorithm are 

Lt+i = %{L t P t Lj + D t )" 1 L t P t P t - 1 1 , (5.7) 
Pt+i = Pt-PtLj {L t P t Lj + D t )- 1 (L t P t Lj +D t -Z)(L t P t Lj + D t y 1 L t P u 
Dt+i = A(S — L t+ iP t +iLj +1 ) . 

One can run Algorithm 5.2 for any number T of steps, and then switch back 
to the H, D parametrization computing Ht = LtQt, which requires only the 
square root at iteration T, i.e. Pt = QtQt 

An alternative approach to avoid the square roots at each iteration of Algo- 
rithm 5.1 is to run it for T-Lt '■— H t Hj . 

Proposition 5.3. Let H t be as in Algorithm 5.1. Pick T~Lo = HqHq , and Dq 
such that Hq + D Q is invertible. The update equation for T-L t becomes 

H t+1 = %(Ht + A) _1 K t (A + %{Ht + A) _1 W t ) _1 S. (5.8) 

Proof. From Equation (5.5) one immediately gets 

H t +i = H t+1 Hj +1 = £{H t + Dt^HtR^HjCHt + A) _1 S. (5.9) 

The key step in the proof is an application of the elementary identity 

(I + if T Pff)- 1 ff T = H T (I + PHH T )~ 1 , 

valid for all H and P of appropriate dimensions for which both inverses exist. 
Note that, by Corollary B.3, the two inverses either both exist or both do not 
exist. We have already seen that Rt is invertible and of the type I + HPH T . 
Following this recipe, we compute 

R^Hj = Hj(l - {H t + r> t y x Ht + {n t + D t )- l t{rl t + Dt^Hty 1 
= Hj{{rl t + B t Y x D t + {Ht + DtY^iUt + Dt^UtY 1 
= Hj(D t + S(K t + D t )- 1 Ht)~ 1 QU + D t ). 
Insertion of this result into (5.9) yields (5.8). □ 
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One can run the update Equation (5.9), for any number T of steps, and then 
switch back to Ht, taking any n x k factor of Ht i.e. solve Ht = HtH^. 
Since Equation (5.9) transforms Ht into Ht+i preserving the rank, the latter 
factorization is always possible. 

It is apparent that the second computational issue we mentioned above, concern- 
ing the inversion of n x n matrices at each step, affects also Algorithm 5.2. The 
alternative form of the update equations derived below only requires the inver- 
sion of k x k matrices: a very desirable property since k is usually much smaller 
than n. Referring to Algorithm 5.1, since D t is invertible, apply Corollary B.2 
to find 

(H t Hj + D t y 1 H t = Dt l H t (I + HjD^H t )-\ 
The alternative expression for R t is 

R t = (I + HjD^Ht)- 1 + {I+HjDt 1 H t )- 1 HjDr 1 %Dr 1 H t (I + H?Dt 1 )-\ 
The update formula (5.5) can therefore be replaced with 

H t+1 = ED^H t (I + HjD^H^R- 1 ' 2 . 
Similar results can be derived also for Algorithm 5.2. 

5.3. Asymptotic properties 

In the portmanteau proposition below we collect the asymptotic properties of 
Algorithm 5.1, also quantifying the I-divergence decrease at each step. 

Proposition 5.4. For Algorithm 5.1 the following hold. 

(a) H t Hj < S for all t > 1. 

(b) If D > and A(£ - D ) > then D t > for all t > 1. 

(c) The matrices R t are invertible for all t > 1. 

(d) If H t Hj +D t =t then H t+1 = H t , D t+1 = D t . 

(e) Decrease of the objective function: 

I(E||E t )-J(E||E t+ i)=2:(E M+ i||E M )+2:(Eo,t||So,t+i), 

where E t = H t Hj + D t is the t-th approximation o/E, and Egt, Ei _ t were 
defined in subsection 5.1. 

(f) The interior limit points (H 7 D) of the algorithm satisfy 

if = (E — HH T )D~ 1 H, D = A(E — HH T ), (5.10) 

which are the ML equations (3.4) and (3.5). If (H, D) is a solution to these 
equation also (HU,D) is a solution, for any unitary matrix U £ M. kxk . 

(g) Limit points (%,D), see (5.9), satisfy 

H = Y,(H + D)- 1 H, D = A(E - H). 
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Proof, (a) This follows from Remark 4.8 and the construction of the algorithm 
as a combination of the two partial minimizations. 

(b) This similarly follows from Remark 4.8. 

(c) Use the identity / - Hj {H t Hj + D t )~ 1 H t = (I + Hj D^ 1 Ht)- 1 and £ 
nonnegative definite. 

(d) In this case, Equation (5.1) shows that R t = I and substituting this into 
the update equations yields the conclusion. 

(e) As matter of fact, we can express the decrease as a sum of two I-divergences, 
since the algorithm is the superposition of the two partial minimization prob- 
lems. The results follows from a concatenation of Proposition 4.4 and Proposi- 
tion 4.7. 

(f) We consider Algorithm 5.2 first. Assume that all variables converge. Then, 
from (5.7), for limit points L,P,D it holds that 

L = %{LPL T + Z)) _1 L, 

which coincides with equation (3.4). Let then Q be a square root of P and 
H = LQ T . This gives the first of the desired relations. The rest is trivial. 

(g) This follows by inserting the result of (f). □ 

In part (f) of Proposition 5.4 we have made the assumption that the limit points 
are interior points. This assumption does not always hold true, it may happen 
that a limit point (H, D) is such that D contains zeros on the diagonal. We will 
treat this extensively in Section 7.1 in connection with a restricted optimization 
problem, in which it is imposed that D has a number of zeros on the diagonal. 

6. Comparison with the EM algorithm 

Rubin and Thayer (1982) put forward a version of the EM algorithm (see Demp- 
ster, Laird and Rubin (1977)) in the context of estimation for FA models. Their 
algorithm is as follows. 

Algorithm 6.1 (EM). 

H t+1 = t{H t Hj + D^HtRi 1 (6.1) 
A+i = A(S - H t+1 R t Hj +1 ), (6.2) 

where Rt = I - Hj {H t Hj + DJ-^HtH? + D t - t)(H t Hj + D t )~ 1 H t . 

The EM Algorithm 6.1 differs in both equations from our Algorithm 5.1. It is 
well known that EM algorithms can be derived as alternating minimizations, 
see Csiszar and Tusnady (1984), it is therefore interesting to investigate how 
Algorithm 6.1 can be derived within our framework. Thereto one considers the 
first partial minimization problem together with the constrained second partial 
minimization Problem 4.10, the constraint being Q = Qo, for some Qq. Later on 
we will see that the particular choice of Qq, as long as it is invertible, is irrelevant. 
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The concatenation of these two problems results in the EM Algorithm 6.1, as is 
detailed below. 

Starting at (H t , D t ,Qo), one performs the first partial minimization, that 
results in the matrix 

S V(H t H t + D t )- l H t Q \ 

QlHj{H t H t +D t )- l t QjRtQo )' 

Performing now the constrained second minimization, according to the results 
of Proposition 4.11, one obtains 

H t+1 = E(H t Hj + Dt^HtRi 1 (6.3) 
A+i = A(S - t{H t Hj + D t )~ 1 H t R^ 1 Hj (H t Hj + A) _1 £). (6.4) 

Substitution of (6.3) into (6.4) yields 

A+i = A(S-ff f+1 i? f ff t T +1 ). 

One sees that the matrix Qq does not appear in the recursion, just as the 
matrices Qt do not occur in Algorithm 5.1. 

Both Algorithms 5.1 and 6.1 are the result of two partial minimization prob- 
lems. The latter algorithm differs from ours in that the second partial mini- 
mization is constrained. It is therefore reasonable to expect that, from the point 
of view of minimizing I-divergence, Algorithm 5.1 yields a better performance, 
although comparisons must take into account that the initial parameters for the 
two species of the second partial minimization will in general be different. We 
will illustrate these considerations by some numerical examples in Section 8. 

We also note that for Algorithm 5.1 it was possible to identify the update 
gain at each step, see Proposition 5.4(e), resulting from the two Pythagorean 
rules. For the EM algorithm a similar formula cannot be given, because for the 
constrained second partial minimization a Pythagorean rule does not hold, see 
the discussion after Proposition 4.11 in Section 4.1. 

7. Singular D 

It has been known for a long time, see e.g. Jbrcskog (1967), that numerical 
solutions to the ML equations (see Section 3) often produce a nearly singular 
matrix D. This motivates the investigation of the stationary points (H,D) of 
Algorithm 5.1 with singular D, i.e. with zeros on the diagonal (Section 7.1). 
Naturally connected to this is the analysis of the minimization Problem 3.1 
when D is constrained, at the outset, to be singular (Section 7.2), and the inves- 
tigation of its consequences for the minimization algorithm of Proposition 5.3 
(Section 7.3). 
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7.1. Stationary points (if, D) with singular D 

As mentioned before, already in Joreskog (1967) it has been observed that, 
numerically maximizing the likelihood, one often reaches matrices D that are 
nearly singular. This motivates the investigation of the stationary points (H, D) 
of Algorithm 5.1 for which D is singular, i.e. 



D ={0 D 2 )-{0 Oj' (") 

where D\ > has size rt\ x ri\ and the lower right zero block has size n 2 x n 2 , 
with ni + n 2 — n. 

Accordingly we partition H £ W ixk as 

*-(!;)■ <"» 

where H x e M" lXfc and H 2 G M" 2Xfe . Then 

We recall that Problem 3.1 calls for the minimization, over H and D, of the 
functional I(Yj\\HH t + D), which is finite if and only if HH T + D is strictly 
positive definite. In view of (7.3), this happens if and only if 

H 2 Hj > 0, 

the standing assumption of this section. A direct consequence of this assumption 
is that n 2 < k. 

The given matrix S will be similarly decomposed as 

S=fg 11 (7.4) 



S 2 i £ 



22- 



Proposition 7.1. // (H,D) is a stationary point of the algorithm, with D as 
in (7.1), then the given matrix E is such that £22 = H 2 Hj and £12 = HiHj . 

Proof. By Proposition 5.4 £ — HH T is nonnegativc definite, as is its lower 
right block £ 22 - H 2 Hj . Since D = A(£ - HH T ) and D 2 = 0, we get that 
A(£ 22 - H 2 Hj) = and therefore £22 = H 2 Hj . We conclude that 

g _ hh t = / En " HxHj £ 12 - H,hA > 

^£21 - # 2 #i T y - ' 

hence £ 12 = Hi7? 2 T - D 
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Define 

Hi := Ht(I - Hji^Hjy 1 ^). (7.5) 
Since / — Hj (H 2 Hj )~ l H 2 is a projection, one finds 

HiHj = H X (I- Hj{H 2 H^)- x H 2 )Hj. (7.6) 

In view of Proposition 7.1 this becomes 

H\Hj = HiHj — Si 2 S 22 1 S 2 i- (7.7) 

Finally we need 

£11 :=Sn — Si2S 2 2 1 ^2i- (7-8) 

Proposition 7.2. // (H, D) is a stationary point of the algorithm with D2 = 0, 
then 

l{t\\HH T + D) = IpuWHiHj + Dx). 
Moreover, the stationary equations (5.10) reduce to 

H x = Vxx{HxHj + Dx^Hx = (E u - HxHj)D^Hx 
Dx=A(Ilxi-HxHj). 

Proof. One easily verifies that for any nonsingular matrix A of the appropriate 
size 

1(APA T \\AQA T ) =1(P\\Q). 



Taking 

,0 / 



one finds 



AT,A T 



En 

E 22 , 



Moreover, by Proposition 7.1, 

A(HH T + D)A T = i^ xHl +Dl Q E i2S 22 1 E 2 i ^0 ^ 

where the upper left block is equal to HxHj + Dx in view of equation (7.7). 
The first assertion follows. The reduced stationary equations follow by simple 
computation. □ 

Remark 7.3. Under the conditions of Proposition 7.2, the pair (Hi, Dx) is also 
a stationary point for the minimization of X(Yjxx\\H x HJ + Dx). This is in full 
agreement with the results of Section 7.2. 
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7.2. Approximation with singular D 

In this section we consider the approximation Problem 3.1 under the con- 
straint Z?2 = 0. Joreskog (1967) investigated the solution of the likelihood equa- 
tions (3.5) and (3.6) under zero constraints on D, whereas in this section we 
work directly on the objective function of Problem 3.1 without referring to those 
equations. The constrained minimization problem can be formulated as 

Problem 7.4. Given E > of size nxn and integers n2 and k, with n2 < k < n, 
minimize 

1{%\HH T +D), (7.9) 

over (H,D) with D satisfying (7.1). 

We will now decompose the objective function, choosing a convenient repre- 
sentation of the matrix H, in order to reduce the complexity of Problem 7.4. To 
that end we make the following observation. Given any orthogonal matrix Q, 
define H' = HQ, then clearly H' H ,T + D = HH T + D. Let H 2 = U(0 A)V T 
be the singular value decomposition of H 2 , with A a positive definite diagonal 
matrix of size n-i x n2, and U and V orthogonal of sizes n2 x n2 and k x k 
respectively. Let 

H' = HV 

The blocks of H' arc H[ = H X V and H' 2 = {H' 2l H' 22 ) := (0 UA), with 
H' 2l e R(fc-»2)xn 2 and H ^ e K n 2 xn 2 _ Hence, without loss of generality, in 
the remainder of this section we assume that 

H ={hI) = { H 1 £)< Avertible. (7.10) 

Finally, let 

K = T.i 2 T. 22 - HiHJ(H 2 hJ) , 
which, under (7.10), is equivalent to 

K = E12S22 1 ~ Hi 2 H 22 . 
Here is the announced decomposition of the objective function. 

Lemma 7.5. Let D be as in equation (7.1). The following I-divergence decom- 
position holds. 

l(E\\HH T + D)=l& 11 \\H 11 Hj 1 + Dx) +T&22WH21HI2) 

+ hr(% 2 K T (H 11 Hj 1 +D 1 )- 1 K). (7.11) 

Proof. See Appendix D. □ 
We are now ready to characterize the solution of Problem 7.4. 
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Proposition 7.6. Any pair (H,D), as in (7.1) and (7.10), solving Problem 7.4 
satisfies 

• I(En||JJxi/fj^ + D\) is minimized, 

• H22HJ2 = ^22 5 

• H12 = Ei2S 2 2 1 ^22- 

Proof. Observe first that the second and third terms on the right hand side 
of (7.11) are nonnegative and can be made zero. To this end it is enough to 
select H 2 2 such that H22HJ2 — ^22 and then iJ 12 = Ei 2 E 22 1 i/22- The remaining 
blocks, Hu and D\, are determined minimizing the first term. □ 

Remark 7.7. In the special case fi2 = k, the matrices Hu and if 21 are empty, 
H12 = Hi, and H22 — Hi. From Proposition 7.6, at the minimum, H2HJ = E22, 
H x Hj = S12, and D 1 minimizes I(Sn||Di). The latter problem has solution 
D\ = A(Sn). It is remarkable that in this case the minimization problem has 
an explicit solution. 

Proposition 7.6 also sheds some light on the unconstrained Problem 3.1. 

Corollary 7.8. Assume that, in Problem 3.1, I(Y,\\HH T + D) is minimized 
for a pair (H, D) with D of the form (7.1). Then S12 = H\Hj , E 2 2 = H2HJ , 
and {H\,Di), where H\ is as in (7.5), minimizes X(Yiu\\HiHj +D%). 

Proof. It is obvious that, in this case, Problem 3.1 and Problem 7.4 are equiv- 
alent. The result readily follows from Proposition 7.6, in view of the equality 
Si = {H n 0). □ 

Hence, in Problem 3.1, a singular minimizer D can occur only if S has the 
special structure described in Corollary 7.8. 

In the same spirit one can characterize the covariances E, admitting an exact 
FA model of size k > n%, and with Z) 2 = 0. This happens if and only if, with 
the notations of Section 7.1, 

En — Ei2S 2 2 1 ^2i is a diagonal matrix. (7-12) 

This condition is easily interpreted in terms of random vectors. Let Y be an 
n dimensional, zero mean, normal vector with Cov(y) = X and partition it 
into two subvectors (Yy, Y 2 ), of respective sizes m and U2, corresponding to the 
block partitioning of E. The above condition states that the components of Y\ 
are conditionally independent given Y 2 . The construction of the fc-dimensional, 
exact FA model, with D 2 — is as follows. 

Let D\ = En — Ei2E 22 1 E2i, which is a diagonal matrix by assumption. Let 
R be the symmetric, invertible, square root of E 22 . Define the matrices 



Hi = E12 (R- 1 0) E R" 1 
H 2 = (R 0) e R n2Xk . 
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One verifies the identities H 2 H 2 = T, 22 , HiHj = E 12 and HiHj = T,i 2 Y, 22 T, 2 i. 
It follows that En = D\ + HiHj . Let Z be a (k — Tridimensional random 
vector, independent of Y, with zero mean and identity covariance matrix. Put 

*-(*?)■ 

Then Cov(X) = 7^. Furthermore, £1 := Y\ — H\X is independent of X with 
Cov(ei) = Di, and y 2 - H 2 X = 0. It follows that 

Yi = FiX + £i 
F 2 = i?2^ 

is an exact realization of Y" in terms of a factor model. 



7.3. Algorithm when a part of D has zero diagonal 

In Section 7.2 we have posed the minimization problem under the additional 
constraint that the matrix D contains a number of zeros on the diagonal. In 
the present section we investigate how this constraint affects the alternating 
minimization algorithm. For simplicity we give a detailed account of this, only 
using the recursion (5.8) for T-i t . Initialize the algorithm at (Hq,Dq) with 

- (f °) , (7,3, 

where D > is of size rt\ x n\ and 




(7.14) 



where H2 € R™ 2X/c is assumed to have full row rank, so that n 2 < k (note the 
slight ambiguity in the notation for the blocks of H ). Clearly H Hj + D is 
invertible. For H as in equation (7.14) put 

'H = H l {I- Hj{H 2 Hjy l H 2 )Hj . (7.15) 

We have the following result. 

Lemma 7.9. Let (Hq,Do) be given as above, and Ho = HqHq . Applying one 
step of recursion (5.8), one gets 




(7.16) 



where 

n 11 = %i{U + D)~ X, H(D + T,xi{U + D)- 1 ?!)- 1 ^!! + SiaS^Sai. (7.17) 
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and 

Dl= (A(2n-*> °). (7.18) 

Proof. We start from Equation (5.8) with t = and compute the value of Hi- 
To that end we first obtain under the present assumption an expression for the 
matrix (H + Dq^H. Let P = I - Hj(H 2 Hjy 1 H 2 . It holds that 



as one can easily verify by multiplying this equation by % + -Do- We also need 
the inverse of Dq + H(H + Dq)^ 1 !!, postmultiplied with S. Introduce U — 
D + T, lx {HiPHj + D)- l HxPHj and 

V = %£%x{HxPHj + D)- 1 + {H 2 Hj)- l H 2 Hj{H 2 Hj)- l D. 
It results that 



Insertion of the expressions (7.19) and (7.20) into (5.8) yields the result. □ 

The update equations of the algorithm for H t and D t , can be readily derived 
from Lemma 7.9 and are summarized below. 

Proposition 7.10. The upper left block Ti^ 1 ofH t , can be computed running a 
recursion for 7i t := T-L^ 1 — E^E^X^i, 

Ut+i = %i(Ht + D^Utibt + Eu(n t + D t )- 1 Ht)- 1 Zu, 

whereas the blocks on the border of Ht remain constant. The iterates for D t all 
have a lower right block of zeros, while the upper left n\ x n\ block D t satisfies 

D t = A(En - % t ). 

□ 

Note that the recursions of Proposition 7.10 are exactly those that follow from 
the optimization Problem 7.4. Comparison with (5.8), shows that, while the 
algorithm for the unconstrained case updates %t of size n x n, now one needs 
to update Ht which is of smaller size n% x n\. 



Now we specialize the above to the case in which n 2 — k. 
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Corollary 7.11. Let the initial value D be as in Equation (7.13) with n 2 = k. 
Then for any initial value Jio the algorithm converges in one step and one has 
that the first iterates D\ and "Hi, which are equal to the terminal values, are 
given by 



Proof. We use Proposition 7.9 and notice that in the present case the matrix % 
of (7.15) is equal to zero. Therefore H 11 = Ei2E^ 2 1 £2i and the result follows. □ 

It is remarkable that in this case we have convergence of the iterates in one 
step only. Moreover the resulting values are exactly the theoretical ones, which 
we have explicitly computed in Remark 7.7. 

8. Numerical examples 
8.1. Simulated data 

In the present section we investigate the performance of the Algorithm 5.1 and 
compare it to the behaviour of the EM Algorithm 6.1. In all examples we take 
E equal to AA T + cdiag(d), where A e R nxrn with m < n, d e M™ and 
c > 0, for various values of n, m. The matrices A and the vector d have been 
randomly generated. The notation A=rand(n, to) means that A is a randomly 
generated matrix of size n x to, whose elements are independently drawn from 
the uniform distribution on [0, 1]. In all cases the resulting matrix E is strictly 
positive definite. The reason for incorporating the component d is that we want 
to check whether the algorithm is able to reconstruct E = AA T + diag(ci) in 
case the inner size k of the matrices H t produced by the algorithm is taken to 
be equal to m. 

We have also included results on the L 2 -norm of the difference between the 
given matrix E and its approximants E t = H t Hj + D t , i.e. we also compute 

_ , 1/2 

it = (tr((E — E f ) T (E — E t ))) . The origin of this extra means of comparison 
of behavior of the Algorithms 5.1 and 6.1 is that we detected in a number of 
cases that in terms of the value of the divergences, the difference between the 
approximations generated by the two algorithms was, after enough iterations, 
negligible, whereas a simple look at the matrices produced by the final iterations 
revealed that Algorithm 5.1 produced very acceptable, if not outstanding results, 
whereas the approximations generated by the EM algorithm 6.1 for the same 
given matrix were rather poor. This phenomenon is reflected by a huge L 2 -error 
of the EM algorithm, as compared to a small one of Algorithm 5.1. The choice 
for the L 2 -norm is to some extent arbitrary. We are basically concerned with 
good approximations in terms of I-divergence, and it is therefore a priori not 
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Fig 1. A=rand(10,5) , d = 2*rand(10,l) , k=2 



completely fair to judge the quality of approximations by switching to another 
criterion. However, the L 2 -norm of the error has an intuitive interpretation, is 
easy to compute and also has some appealing properties in the context of the 
two partial minimization problems, cf. Remarks 4.5 and 4.9. 

We have plotted various characteristics of the algorithms against the number 
of iterations, for both of them the divergence at each iteration, as well as their 
counterparts for the L 2 -norm (dashed lines). For reasons of clarity, in all figures 
we have displayed the characteristics on a logarithmic scale. 




Legenda 

solid blue: divergence X(S||£t) in algorithm 5.1 

solid red: divergence in the EM algorithm 6.1 

dashed blue: L 2 -norm of £ — £ t in algorithm 5.1 

dashed red: L 2 -norm in the EM algorithm 6.1 



The numerical results have been obtained by running Matlab. 

Figures 1 and 2 show the behaviour of the two algorithms in cases with 
n = 10 (which is relatively small) and for k — 2,5 respectively. We observe 
that the performance of the algorithms hardly differ, especially for k = 2. In 
Figures 3 and 4, we have n = 30 and k — 5, 15 respectively. We notice that 
in terms of divergence, the performance of the two algorithms is roughly the 
same for k = 5, but for k = 15 there are noticeable differences. But looking 
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Fig 3. J 4=rand(30,15) , d = 3*rand(30,l) , fc=5 
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Fig 5. .4=rand(50,30) , d = 5*rand(50,l) , fc=10 
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Fig 6. yl=rand(50,30) , d = 5*rand(50,l) , fc=30 

at the L 2 -norm of the error, we even see a manifest difference of the outcomes. 
The differences are even more pronounced in Figures 5 and 6, where n = 50 
and k = 10, 30 respectively. In the former case, in terms of divergences, the 
two algorithms behave roughly the same, but there is a factor 10 of difference 
in the L 2 -erros. In the latter case, where k — m one would expect that both 
algorithms are able to retrieve the original matrix S, which seems to be the 
case, although Algorithm 5.1 behaves the best. Looking at the L 2 -error, we see 
a gross difference between the Algorithm 5.1 and the EM algorithm of order 
about 100. This striking difference in behaviour between the two algorithms is 
typical. 

8.2. Real data example 

In the present section we test our algorithm on the data provided in the orig- 
inal paper Rubin and Thayer (1982), where the EM algorithm for FA models 
has been presented first. The results, with in this case £ the empirical corre- 
lation matrix of the data, are presented in Figure 7. We observe that again 
Algorithm 5.1 outperforms the EM algorithm. The underlying numerical results 
are at first sight very close to those of Rubin and Thayer (1982) (we have also 
taken k = 4), but we observe like in the previous section that the convergence 
of the EM algorithm is much slower than that of Algorithm 5.1 and after 50 
iterations (the same number as in Rubin and Thayer (1982)) the differences are 
quite substantial. 



div 

divEM 

L2 

- L2EM 





Fig 7. Rubin-Thayer data, k=4 



Appendix A: Multivariate normal distribution 

Let (X T ,Y T ) T be a zero mean normally distributed random vector with co- 
variance matrix 

y, _ f^xx ^xy\ 

\T,yx Eyy J 

Assume that Eyy is invertible, then the conditional distribution of X given Y 
is normal, with mean vector E [X|Y~] = ExyE^y Y" anc ^ covariance matrix 

Cov[X\Y] = Yl xx - E X yE yy Ey X . (A.l) 

Consider two normal distributions V\ = iV(/Xi,Ei) and V2 = N (112,^2) on a 
common Euclidean space. The I-divergence is easily computed as 

^1 Ik) = \ log - I + ^(E^Ei) + - M2) T E 2 - 1 (/ii - 

- I(Ei||E a ) + - M2) T E 2 - 1 (/xi - m), (A.2) 

where Z(Ei||E2) denotes as before the I-divergence between positive definite 
matrices. The extra term, depending on the nonzero means, did not appear 
in (3.2). 

Appendix B: Matrix identities 

For ease of reference we collect here some well known identities from matrix 
algebra. 











/A — CD~ 1 B 




3 






\ o 
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The following lemma is verified by a straightforward computation. 

Lemma B.l. Let A,B,C,D be blocks of compatible sizes of a given matrix, 
with A and D both square. If D is invertible the following decomposition holds 

Q\( I 
D) \p- 1 B I 

while, if A is invertible, the following decomposition holds 

A C\ f I 0\ [A \ (I A- l C 

B D)~\BA- 1 D-BA- l c)\Q I 

Furthermore, assuming that A, D, and A — CD~ l B are all invertible, we have 

A C^ L 
B D 

{A-CD^B)- 1 -(A - CD- 1 B)- 1 CD- 1 

-D- X B(A - CD- X B)- X D- 1 + D- l B{A - CD^B^CD' 1 

Corollary B.2. Let A,B,C,D matrices as in Lemma B.l with A, D, and 
A — CD- 1 B all invertible. Then D — BAC is also invertible, with 

(D - BAG)- 1 = D- 1 + D- 1 B{A- 1 - CD~ 1 B)- 1 CD- 1 . 

Proof. Use the two decompositions of Lemma B.l with A replaced by A~ l and 
compute the two expressions of the lower right block of the inverse matrix. □ 

Corollary B.3. Let B e R nxm andC £ W nxn . Then det(J„ -BC) = det(J ro - 
CB) and I n — BC is invertible if and only if I rn — CB is invertible. 

Proof. Use the two decompositions of Lemma B.l with A = I m and D = I n to 
compute the determinant of the block matrix. □ 

Corollary B.4. Let D be a positive definite matrix, not necessarily strictly 
positive definite. If HH T +D is strictly positive definite then also I—H T (HH T + 
D)- 1 H is strictly positive. 

Proof. Use Lemma B.l with A = I,B = H,C = H T and D replaced with 
HH T + D. The two middle matrices in the decompositions are respectively 

/ — H T (HH T + D)- 1 H 

HH T + D 

and 

f I 
D 

Hence, from the second decomposition it follows from positive dcfmitcncss of D 
( I H T \ 

that I „ TT rrT „ ) is positive definite, and then from the first decomposi- 
\H Hn + D J 

tion that I - H T {HH T + D^H is positive definite. □ 



L. Finesso, P. Sprei] '/ 'approximate factor analysis 28 

Appendix C: Decompositions of the I-divergence 

We derive here a number of decomposition results for the I-divergence between 
two probability measures. Similar results are derived in Cramer (2000), see 
also Finesso and Spreij (2006) for the discrete case. These decompositions yield 
the core arguments for the proofs of the propositions in Sections 4.1 and 7.2. 

Lemma C.l. Let Vxy and Qxy be given probability distributions of a Eu- 
clidean random vector (X, Y) and denote by Px|y and Qx\y the corresponding 
regular conditional distributions of X given Y. Assume that¥xY *C Qxy- Then 

I(¥xy\\Qxy) =1(V Y \\Q Y ) + E Py 1(V XI y\\Qx\y)- (C.l) 

Proof. It is easy to see that we also have Vy <C Qy- Moreover we also have 
absolute continuity of the conditional laws, in the sense that if is a version 
of the conditional probability Q(X G B\Y), then it is also a version of P(X £ 
B\Y). One can show that a conditional version of the Radon-Nikodym theorem 

applies and that a conditional Radon-Nikodym derivative d Q*'^ exists Qy- 
almost surely. Moreover, one has the Qxy-a.s. factorization 

dPxY _ dPx|Y dF Y 
dQxY ~ dQ X \Y dQ7' 

Taking logarithms on both sides and expectation under Fxy yields 

, dP XY ^ , dPxiy m , dPy 

Ep - log dQ*7 = Ep - log dQ^ + log dQy- 

Writing the first term on the right hand side as Ep Y1 ,{Ep Y1 , [log gjj^y^l^l}, wc 
obtain E Py {E Px]Y [log \y]}. The result follows. □ 

The decomposition of Lemma C.l is useful when solving I-divergence mini- 
mization problems with marginal constraints, like the one considered below. 

Proposition C.2. Let Qxy and P Y be given probability distributions of a Eu- 
clidean random vector (X,Y), and of its subvectorY respectively. Consider the 
I-divergence minimization problem 

min 1(V XY \\Qxy), 



where 



V:={¥ X y\ f V X Y(dx,Y) =F° Y }. 



If the marginal ¥ Y <C Qy, then the I-divergence is minimized by P* X y specified 
by the Radon-Nikodym derivative 

dP*XY = dP° y (C2) 
dQxY dQy' { • ' 
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Moreover the Pythagorean rule holds i.e., for any other distribution P € V , 

2(V XY \\Q XY ) =2(V XY \\V XY ) +2(V XY \\Q XY ), (C.3) 
and one also has 

I(V XY \\Q XY )=1(V Y \\Q Y ). (C.4) 
Proof. The starting point is equation (C.l), which now takes the form 

i(v XY \\q XY ) =i(¥° Y \\q Y ) + e Py i(v xiy \\q xiy ). (c.5) 

Since the first term on the right hand side is fixed, the minimizing V* XY must 
satisfy ¥* X , Y = Q X \ Y . It follows that V* XY = V* xlY V Y = Q X \ Y V Y , thus verify- 
ing (C.2) and (C.4). We finally show that (C.3) holds. 

HIP HIP* 

i(v XY \\q XY ) = e Pxy log — ^ + E fxY log -pi 

dP° 

= 1(V xy \\V xy )+E Py log^ 

d¥ Y 



= 1(V XY \\V XY )+E rl , log 

where we used that any V XY £ V has F-marginal distribution F Y . □ 

The results above can be extended to the case where the random vector 
(X, Y) := (X,Yi, . . .Y m ), i.e. Y consists of m random subvectors Yi. For any 
probability distribution ¥ XY on (X, Y), consider the conditional distributions 
Py.|x and define the probability distribution F XY on (X, Y): 

Pxy =\[¥ Y \ X ¥ X . 

i 

Note that, under V XY , the Yi are conditionally independent given X. The fol- 
lowing lemma sharpens Lemma C.l. 

Lemma C.3. Let V XY and Q XY be given probability distributions of a Eu- 
clidean random vector (X, Y) :— (X, Yi, . . . Y m ). Assume that V XY <C Qxy and 
that, under Q XY , the subvectors Yi ofY are conditionally independent given X, 
then 

T(Pxy WQxy) = AVxy I |Pxk) + J2 ^v x T(V Y {x \ \Q Y{X ) + 2(V x \\Qx)- 

i 

Proof. The proof runs along the same lines as the proof of Lemma C.l. We start 
from equation (C.l) with the roles of X and Y reversed. With the aid of Vxy 
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one can decompose the term E Px Z(Py|x||Qy|x) as follows. 
E Px l(¥ Ylx \\Q Ylx ) = E Px E Pyix log -— L_ 

( dP Ylx dPytX \ 

= E Px E Vyix log -J^- + log — ^ 

\ dPy| X 0-1Jy\X J 

= E Px 1(¥ y \ x \\Py\x) + Ep x E P ^ |X lo S 

= I(Pxy||P X y) + J> Px E P log 
= Z(P X y||Pxy)+^E Px Z(P 



where we used the fact that d ^ XY = ^ y|x . This proves the lemma. □ 

dP x y dPy i x 

The decomposition of Lemma C.3 is useful when solving I-divergence mini- 
mization problems with conditional independence constraints, like the one con- 
sidered below. 

Proposition C.4. Let ¥xy be a given probability distribution of a Euclidean 
random vector (X,Y) := {X,Y\, . . .Y m ). Consider the I-divergence minimiza- 
tion problem 

min 2{Vxy\\Qxy), 

where 

Q ■= {Qxy I Q Yl ,....Y m \x = liQ Yi \x}- 

i 

If^XY <^ Qxy f or some Qxy G Q then the I-divergence is minimized by 

Q*XY = ^XY 

Moreover, the Pythagorean rule holds, i.e. for any Qxy G Q, 

1(V xy \\Qxy) =1(V X y\\Qxy) +AQ*xy\\®xy)- 

Proof. From the right hand side of the identity in Lemma C.3 we see that the 
first I-divergence is not involved in the minimization, whereas the other two can 
be made equal to zero, by selecting Q Yi \x = ^Yi\x an d Qx = Px- This shows 
that the minimizing Q* X y ^ s equal to Vxy- 
To prove the Pythagorean rule, we first observe that trivially 



"xyMxy) 



:1(F X y\Vxy)- (C.6) 
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Next we apply the identity in Lemma C.3 with Q* X y replacing Vxy- ln this case 
the corresponding Q* X y obviously equals Q* X y itself. Hence the identity reads 

I(Q*xy\\Qxy) =Yl E ^ x I (QY i \xW^\x) +l(Q*x\\®x) 

i 

= J2^xZ(Vy\x\\Qy\x)+Z(Vx\\Qx), (C.7) 

i 

by definition of Q^y- Adding up equations (C.6) and (C.7) gives the result. □ 
Appendix D: Proof of the technical results 

PROOF of Proposition 3.2. Existence of the minimum. Let (H ,D ) be arbi- 
trary. Perform one iteration of the algorithm to get (Hi,Di) with X{T, \ \Hi Hj + 
£>i) < 1{%\H H^ + D Q ). Moreover, from Proposition 5.4, H 1 Hj < E and 
Di < A(E). Hence the search for a minimum can be restricted to the set 
of matrices (H,D) satisfying HH T < E and D < A(E). We claim that the 
search for a minimum can be further restricted to the set of (H, D) such that 
HH T + D > el for some sufficiently small e > 0. Indeed, if the last inequal- 
ity is violated, then HH T + D has at least one eigenvalue less than e. As- 
sume this is the case^ write HH T + D = UAU T , the Jordan decomposition of 
HH T + D, and let E = U^ V U T . Then l(E\\HH T + D) = | |A), as one 

easily verifies. Denoting by A, the eigenvalues of HH T + D, X io the smallest 
among them, and by an the diagonal elements of Xj/, we have that I(E[/|A) = 

\ U°g + ^) — 1 1°S \^u\ — f ■ Choose e smaller than c := min, an > 0, 

since E > 0. Then the contribution of i = io in the summation is larger than 
loge + | which tends to infinity for e — > 0. Hence the claim is verified. This 
shows that a minimizing pair (H,D) has to satisfy HH T < E, D < A(E), 
and HH T + D > el, for some e > 0. In other words we have to minimize the 
I-divergence over a compact set on which it is clearly continuous. This proves 
Proposition 3.2. □ 

PROOF of Proposition 4.2. Relation between the original and the lifted 
problem. Let Ei = Z{H,D,Q). With E* = S*(Ei), the optimal solution of the 
partial minimization over E , we have for any E G E , using (4.3) in the first 
equality below, 

I(E ||E 1 )>Z(E*||E 1 ) 

= 1(%\\HH T +D) 
> inf X(Z\\HH T +D). 

~ H,D 

It follows that inf Eog SigEi ^(^o||Ei) > min^ £>I(E||i7i7 T + D), since the 
minimum exists in view of Proposition 3.2. 
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Conversely, let (H*,D*) be the minimizer of (H,D) i-> X(T,\\HH T + D), which 
exists by Proposition 3.2, and let E* = Y,(H* , D* ,Q*) be a corresponding ele- 
ment in Si. Furthermore, let E** G S be the minimizer of E i-> ^(So||S*) 
over So- Then we have 

1(%\H*H* T +D*) =Z(E**||S*) 

> inf J(E ||Ei), 

which shows the other inequality. Finally, we prove that the the infimum can be 
replaced with a minimum. Thereto we will explicitly construct a minimizer in 
terms of {H*,D*). For any invertible Q* let E* = E(H*, D* , Q*). Performing 
the first partial minimization, we obtain an optimal E** € So, with the property 
(see (4.3)) that Z(E**|E*) =X(T.\\H*H* r + D*). □ 

Proof of Proposition 4.4. First partial minimization. Consider the setup 
and the notation of Proposition C.2. Identify Q with the normal N(0, E), and P 
with N(0, Eo). By virtue of (C.2), the optimal P* is a zero mean normal whose 
covariance matrix can be computed using the properties of conditional normal 
distributions (see appendix A). In particular 

E^ = E P ,XY T = E r , (E P * [X\Y]Y T ) 
= E P ,(E Q [X\Y}Y T ) 
= E P .(E 21 E u 1 yr T ) 
= EaiE^EpoYF 7 
= E2iE 11 1 E. 

Likewise 

E; 2 =E r ,XX T = Cov r ,(X\Y) +E P ,(E P ,[X\Y}E P ,[X\Y} T ) 
= Cov Q {X\Y) +E P ,(E Q [X\Y]E (1 [X\Y} T ) 
= E 22 - E 21 E^E 12 +E P ,(E 21 E n 1 r(E 21 E n 1 r) T ) 
= E 22 — E 2 iE 11 Ei 2 + Epo (E 2 iE 11 YY E n Ei 2 ) 
= E 22 — E 2 iE 11 1 Ei 2 + E 2 iE 11 1 SE 11 1 Ei 2 . 

To prove that E* is strictly positive note first that E^ = E > by assumption. 
To conclude, since E > 0, it is enough to note that 

^22 — ^2l(^ll) ^12 = ^22 — S 21 E 11 1 E 12 



Finally, the relation X(E*||E) = Z(E||En) is Equation (C.4) adapted to the 
present situation. The Pythagorean rule follows from this relation and Equa- 
tion (C.5). □ 
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Proof of Proposition 4.7. Second partial minimization. We adhere to the 
setting and the notation of Proposition C.4. Identify P = Fxy with the normal 
distribution N(0, E) and Q = Q X y with the normal N(0, E x ), where Ei 6 Si. 
The optimal Q* = Qxy is again normal and specified by its (conditional) mean 
and covariance matrix. Since Qy-.pf = ^Yi\x for all i, we have Eq»[Y|X] = 
E P [y|X] = E^Ej/X, moreover Q* x = P X - Hence we find 

E 12 = E Q .yi T = Eq.Eq- [Y\X]X t = E r Ep[Y\X]X T = Ei 2 . 

Furthermore, under Q*, the Yi are conditionally independent given X. Hence 
Cav q ,{Y u Y j \X) = 0, for i ^ j, whereas Var Q .(li|X) = Var P (y i |X), which is 
the ii-element of (En — E12E22 S21), it follows that 

Cov Q ,(F|X) = A(Eu - EiaE^Esi). 

We can now evaluate 

E*j = CovQ»(y) = Eq*YY t 

= Eq* (E q, [Y \X]E [Y\X] t + Cov Q , (Y|X)) 

= Eq* (E 1 2E 22 1 XX T E 22 1 E2i + A(E X1 — E 12 E 22 1 E 21 )) 

— E 12 E 22 1 E 21 + A(E n — E 12 E 22 1 E 21 ). 

The Pythagorean rule follows from the general result of Proposition C.4. □ 

Proof of Proposition 4.11. Constrained second partial minimization. 
Lemma C.3 and Proposition C.4 still apply, with the proviso that the marginal 
distribution of X is fixed at some Q x . The optimal distribution Q* XY will there- 
fore take the form Q* XY = Ili ^YAxQx- Turning to the explicit computation of 
the optimal normal law, inspection of the proof of Proposition 4.7 reveals that 
under Q* we have E^YX T = E^E^Po and 

CovQ*(y) = A(En — Ei 2 E 22 1 E 2 i) + Ei 2 E 22 1 PoE 22 1 E 2 i. 

□ 



Proof of Lemma 7.5. Technical decomposition of the I-divergence. Recall 
the following notation. 

H = (£) (m) 

»=d ^, ( D ,) 



E = (5 11 5 12 ), (D.3) 



E21 E22 

J 22 



En —En — Ei2E 22 1 E 2 i, (D.4) 
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where H x e R niXk , H 2 G R" 2Xfe , D x G R«i x ™i and D 2 G R™ 2X ™ 2 . 

Define 5 = Hi(I - Hj (H 2 Hj + D 2 )- 1 H 2 )Hj + D x and K = % 2 % 2 - 
H x Hj (H 2 Hj + D 2 )- x . From Lemma C.l we obtain that l(S\\HH T +D) is the 
sum of I(E2a| \H 2 H J + D 2 ) and an expected I-divergence between conditional 
distributions. The latter can be computed according to Equation (A. 2), and 
gives the decomposition result 

X{%\HH T +D) =l{% 22 \\H 2 Hj+D 2 )+l& ll \\S) + ^tv{S- l K% 22 K T }. (D.5) 

The assertion of Lemma 7.5 is then obtained by taking D 2 = and the further 
decomposition of H\ and H 2 as in (7.10). □ 
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